Repeat performance tests up to 5 times on GHA#9000
Conversation
|
memcpy does not pass on my Linux machine, ever. It is 3x slower than libc's memcpy. I looked at the disassembly of both Halide and memcpy: it seems that libc has a ~5 times unrolled ymm-vectorized move, with streaming stores, and memory prefetching. So it looked like 6x prefetch, 6x load, 6x stream store, repeat. |
|
Maybe we should just delete it. It's true that a Halide memcpy shouldn't be inherently slower than a libc memcpy, but there are various reasons that might be the case. The test is really asking "do we generate a sane inner loop for a memcpy", but a sane inner loop might be a long way from the best inner loop on a particular machine, and if we didn't generate a sane inner loop for the most trivial pipeline possible much much more would be broken than just that test. |
The GHA runners are very noisy and our performance tests aren't very stable anyway.
skip_buildbotsbecause this is GHA workflow-only